NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

End-to-End Audiovisual Speech Recognition System With Multitask Learning

https://doi.org/10.1109/TMM.2020.2975922

Tao, Fei; Busso, Carlos (January 2021, IEEE Transactions on Multimedia)
null (Ed.)
Full Text Available
End-to-end audiovisual speech activity detection with bimodal recurrent neural models

https://doi.org/10.1016/j.specom.2019.07.003

Tao, Fei; Busso, Carlos (October 2019, Speech Communication)

Full Text Available
Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory

https://doi.org/10.21437/Interspeech.2018-2490

Tao, Fei; Busso, Carlos (September 2018, Interspeech 2018)

Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames, since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.
more » « less
Full Text Available
Aligning Audiovisual Features for Audiovisual Speech Recognition

https://doi.org/10.1109/ICME.2018.8486455

Tao, Fei; Busso, Carlos (July 2018, IEEE International Conference on Multimedia and Expo (ICME))

Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal information within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective memory of the units is limited to a few frames, since the recurrent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including multiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.
more » « less
Full Text Available
Gating Neural Network for Large Vocabulary Audiovisual Speech Recognition

https://doi.org/10.1109/TASLP.2018.2815268

Tao, Fei; Busso, Carlos (July 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Audiovisual Speech Activity Detection with Advanced Long Short-Term Memory

Tao, Fei and (January 2018, Interspeech 2018)

Speech activity detection (SAD) is a key pre-processing step for a speech-based system. The performance of conventional audio-only SAD (A-SAD) systems is impaired by acoustic noise when they are used in practical applications. An alternative approach to address this problem is to include visual information, creating audiovisual speech activity detection (AV-SAD) solutions. In our previous work, we proposed to build an AV-SAD system using bimodal recurrent neural network (BRNN). This framework was able to capture the task-related characteristics in the audio and visual inputs, and model the temporal infor- mation within and across modalities. The approach relied on long short-term memory (LSTM). Although LSTM can model longer temporal dependencies with the cells, the effective mem- ory of the units is limited to a few frames, since the recur- rent connection only considers the previous frame. For SAD systems, it is important to model longer temporal dependencies to capture the semi-periodic nature of speech conveyed in acoustic and orofacial features. This study proposes to implement a BRNN-based AV-SAD system with advanced LSTMs (A-LSTMs), which overcomes this limitation by including mul- tiple connections to frames in the past. The results show that the proposed framework can significantly outperform the BRNN system trained with the original LSTM layers.
more » « less
Full Text Available

Search for: All records